{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-colab"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/pathwaycom/pathway/blob/main/examples/notebooks/showcases/fuzzy_join_part2.ipynb\" target=\"_parent\"><img src=\"https://pathway.com/assets/colab-badge.svg\" alt=\"Run In Colab\" class=\"inline\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "notebook-instructions",
      "source": [
        "# Installing Pathway with Python 3.10+\n",
        "\n",
        "In the cell below, we install Pathway into a Python 3.10+ Linux runtime.\n",
        "\n",
        "> **If you are running in Google Colab, please run the colab notebook (Ctrl+F9)**, disregarding the 'not authored by Google' warning.\n",
        "> \n",
        "> **The installation and loading time is less than 1 minute**.\n"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "id": "pip-installation-pathway",
      "source": [
        "%%capture --no-display\n",
        "!pip install --prefer-binary pathway"
      ],
      "execution_count": null,
      "outputs": [],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "id": "logging",
      "source": [
        "import logging\n",
        "\n",
        "logging.basicConfig(level=logging.CRITICAL)"
      ],
      "execution_count": null,
      "outputs": [],
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "id": "2",
      "metadata": {
        "lines_to_next_cell": 0
      },
      "source": [
        "# Interaction with a Feedback Loop.\n",
        "\n",
        "## Fuzzy Join - reconciliation with audit: when the computer is not enough\n",
        "\n",
        "In this article, we are going to show you how Pathway interacts with incremental data flows with a **feedback loop**.\n",
        "\n",
        "In the [another showcase](/developers/templates/fuzzy_join/fuzzy_join_chapter1) we explained how `smart_fuzzy_join` may be helpful in bookkeeping.\n",
        "Previously, we had a simple pipeline that matched entries of two different tables, such as two logs of bank transfers, in two different formats.\n",
        "Many matchings can be inferred automatically, but some can be really tricky without help: while the fans of Harry Potter can instantaneously make the connection between 'You-Know-Who' and 'Voldemort', it is impossible for a computer to do so, at least without help.\n",
        "\n",
        "Human audit is unavoidable in many areas such as accounting or banking.\n",
        "As such, we extend our pipeline with an auditor that supervises the process of reconciliation.\n",
        "The auditor may help the system by providing some hints, i.e. suggesting difficult matchings by hand.\n",
        "\n",
        "## Feedback loop in Pathway\n",
        "![Graph image](https://pathway.com/assets/content/showcases/fuzzy_join/reconciliation_chapter3_trim.png)\n",
        "\n",
        "This figure represents an architecture with a feedback loop to understand how the pieces work together.\n",
        "\n",
        "Reconciliation by SmartFuzzyJoin lies at the heart of the architecture:\n",
        "- it consumes inputs from 3 sources:\n",
        "  - two tables with transactions in different formats;\n",
        "  - a table with manual corrections provided by the auditor;\n",
        "- it outputs one table with matched records.\n",
        "\n",
        "You might think of the auditor as a simple automaton.\n",
        "Either they are satisfied with presented results and simply save them in some storage, or they provide some hints for the algorithm to find a better matching.\n",
        "\n",
        "**Note:** Although the architecture contains a feedback loop, all tables here are either inputs or outputs of the system.\n",
        "\n",
        "\n",
        "## The data\n",
        "\n",
        "Human audit is certainly needed to handle the sample dataset below.\n",
        "\n",
        " **Recipient and sender in a 'standard' CSV format**\n",
        "\n",
        "|id    |recipient|sender       |\n",
        "|------|---------|-------------|\n",
        "|1     |Bill H.  |Nancy R.     |\n",
        "|2     |Harry P. |Hermione  G. |\n",
        "|3     |Julian S.|Dick F.      |\n",
        "\n",
        "\n",
        " **Messages describing the transactions**\n",
        "\n",
        "|id    |message  |\n",
        "|------|---------|\n",
        "|A     |Dear William, thank you for your patience. Regards, Ann|\n",
        "|B     |Dear Colleague! I think they might have sent me your particle! Yours, Richard|\n",
        "|C     |Paying back for two Chocolate Frogs, cheers Hermione!|\n",
        "\n",
        "## Automatic reconciliation\n",
        "Let's see how many records we can match without any human help.\n",
        "We reuse code from [Part 1 of this showcase](/developers/templates/fuzzy_join/fuzzy_join_chapter1)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "3",
      "metadata": {
        "lines_to_next_cell": 2
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "\n",
        "import pathway as pw\n",
        "\n",
        "# To use advanced features with Pathway Scale, get your free license key from\n",
        "# https://pathway.com/features and paste it below.\n",
        "# To use Pathway Community, comment out the line below.\n",
        "pw.set_license_key(\"demo-license-key-with-telemetry\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "4",
      "metadata": {},
      "source": [
        "We need to read the csv files:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "5",
      "metadata": {
        "lines_to_next_cell": 2
      },
      "outputs": [],
      "source": [
        "# Uncomment to download the required files.\n",
        "# %%capture --no-display\n",
        "# !wget https://public-pathway-releases.s3.eu-central-1.amazonaws.com/data/fuzzy_join_part_2_transactionsA.csv -O transactionsA.csv\n",
        "# !wget https://public-pathway-releases.s3.eu-central-1.amazonaws.com/data/fuzzy_join_part_2_transactionsB.csv -O transactionsB.csv\n",
        "# !wget https://public-pathway-releases.s3.eu-central-1.amazonaws.com/data/fuzzy_join_part_2_audit1-v2.csv -O audit1.csv\n",
        "# !wget https://public-pathway-releases.s3.eu-central-1.amazonaws.com/data/fuzzy_join_part_2_audit2-v2.csv -O audit2.csv"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "6",
      "metadata": {
        "lines_to_next_cell": 2
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "            | key | recipient | sender\n",
            "^YYY4HAB... | 1   | Bill H.   | Nancy R.\n",
            "^Z3QWT29... | 2   | Harry P.  |  Hermione  G.\n",
            "^3CZ78B4... | 3   | Julian S. | Dick F.\n",
            "            | key | message\n",
            "^KHNET5G... | A   | Dear William, thank you for you patience. Regards, Ann\n",
            "^FFN4QBS... | B   | Dear colleague! I think they might have send me your part! Yours, Richard\n",
            "^0K14V55... | C   | Paying back for two Chocolate Frogs, cheers Hermione\n"
          ]
        }
      ],
      "source": [
        "class TransactionsA(pw.Schema):\n",
        "    key: int = pw.column_definition(primary_key=True)\n",
        "    recipient: str\n",
        "    sender: str\n",
        "\n",
        "\n",
        "class TransactionsB(pw.Schema):\n",
        "    key: str = pw.column_definition(primary_key=True)\n",
        "    message: str\n",
        "\n",
        "\n",
        "transactionsA = pw.io.csv.read(\n",
        "    \"./transactionsA.csv\",\n",
        "    schema=TransactionsA,\n",
        "    mode=\"static\",\n",
        ")\n",
        "transactionsB = pw.io.csv.read(\n",
        "    \"./transactionsB.csv\",\n",
        "    schema=TransactionsB,\n",
        "    mode=\"static\",\n",
        ")\n",
        "pw.debug.compute_and_print(transactionsA)\n",
        "pw.debug.compute_and_print(transactionsB)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "7",
      "metadata": {
        "lines_to_next_cell": 2
      },
      "source": [
        "We use the provided column `key` as indexes: Pathway will generate indexes based on those.\n",
        "\n",
        "We add a wrapper `reconcile_transactions` to replace the generated indexes by the corresponding key."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "8",
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "            | left_key | confidence | right_key\n",
            "^VFPBYAD... |          | 0.0        |\n",
            "^VFPEVKP... |          | 0.0        |\n",
            "^QFWX2FF... | 2        | 0.5        | C\n"
          ]
        }
      ],
      "source": [
        "def match_transactions(transactionsA, transactionsB, by_hand_matching):\n",
        "    matching = pw.ml.smart_table_ops.fuzzy_match_tables(\n",
        "        transactionsA, transactionsB, by_hand_match=by_hand_matching\n",
        "    )\n",
        "\n",
        "    transactionsA_reconciled = transactionsA.select(\n",
        "        left=None, right=None, confidence=0.0\n",
        "    ).update_rows(\n",
        "        matching.select(pw.this.left, pw.this.right, confidence=pw.this.weight).with_id(\n",
        "            pw.this.left\n",
        "        )\n",
        "    )\n",
        "    return transactionsA_reconciled\n",
        "\n",
        "\n",
        "def reconcile_transactions(\n",
        "    transactionsA,\n",
        "    transactionsB,\n",
        "    audit=None,\n",
        "):\n",
        "    by_hand_matching = pw.Table.empty(left=pw.Pointer, right=pw.Pointer, weight=float)\n",
        "    if audit is not None:\n",
        "        by_hand_matching = audit\n",
        "        by_hand_matching = by_hand_matching.select(\n",
        "            left=transactionsA.pointer_from(pw.this.left),\n",
        "            right=transactionsB.pointer_from(pw.this.right),\n",
        "            weight=pw.this.weight,\n",
        "        )\n",
        "    transactionsA_reconciled = match_transactions(\n",
        "        transactionsA, transactionsB, by_hand_matching\n",
        "    )\n",
        "    transactionsA_reconciled = transactionsA_reconciled.join_left(\n",
        "        transactionsA, pw.left.left == pw.right.id\n",
        "    ).select(pw.left.right, pw.left.confidence, left_key=pw.right.key)\n",
        "    transactionsA_reconciled = transactionsA_reconciled.join_left(\n",
        "        transactionsB, pw.left.right == pw.right.id\n",
        "    ).select(pw.left.left_key, pw.left.confidence, right_key=pw.right.key)\n",
        "    return transactionsA_reconciled, by_hand_matching\n",
        "\n",
        "\n",
        "matching, _ = reconcile_transactions(transactionsA, transactionsB)\n",
        "pw.debug.compute_and_print(matching)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "9",
      "metadata": {
        "lines_to_next_cell": 0
      },
      "source": [
        "Not a perfect matching. It seems that the help of an auditor is needed.\n",
        "\n",
        "\n",
        "## Incremental reconciliation with an auditor\n",
        "The correct matching is 1 - A, 2 - C and 3 - B. Why? [Tip 1](https://en.wikipedia.org/wiki/Bill_%28given_name%29), [Tip 2](https://www.nobelprize.org/prizes/physics/1965/summary/).\n",
        "\n",
        "Previously, the algorithm identified matching 2 - C correctly but failed to find the connections between the other pairs.\n",
        "Now, we run it with a hint - feedback from an auditor.\n",
        "\n",
        "To include the hint (nothing complicated), we just need to launch our function with the parameter `audit`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "10",
      "metadata": {},
      "outputs": [],
      "source": [
        "class AuditSchema(pw.Schema):\n",
        "    left: int\n",
        "    right: str\n",
        "    weight: float\n",
        "\n",
        "\n",
        "audit = pw.io.csv.read(\"./audit1.csv\", schema=AuditSchema, mode=\"static\")\n",
        "matching, suggested_matchings = reconcile_transactions(\n",
        "    transactionsA, transactionsB, audit\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "11",
      "metadata": {},
      "source": [
        "Here is the author's feedback, the pair 1 - A:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "12",
      "metadata": {
        "lines_to_next_cell": 0
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "            | left        | right       | weight\n",
            "^E4J9BTQ... | ^YYY4HAB... | ^KHNET5G... | 1.0\n"
          ]
        }
      ],
      "source": [
        "pw.debug.compute_and_print(suggested_matchings)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "13",
      "metadata": {
        "lines_to_next_cell": 0
      },
      "source": [
        "Given this feedback, we check that the new matching took into account this pair:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "14",
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "            | left_key | confidence | right_key\n",
            "^VFPBYAD... |          | 0.0        |\n",
            "^G1GM5SW... | 1        | 1.0        | A\n",
            "^QFWX2FF... | 2        | 0.5        | C\n"
          ]
        }
      ],
      "source": [
        "pw.debug.compute_and_print(matching)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "15",
      "metadata": {
        "lines_to_next_cell": 0
      },
      "source": [
        "Still not perfect but better. It seems that more help from the auditor is needed.\n",
        "Now, with one more extra hint the algorithm matches all the records correctly."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "16",
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "            | left | right | weight\n",
            "^T5Q1HY0... | 1    | A     | 1.0\n",
            "^9MF722E... | 3    | B     | 1.0\n"
          ]
        }
      ],
      "source": [
        "audit = pw.io.csv.read(\"./audit2.csv\", schema=AuditSchema, mode=\"static\")\n",
        "pw.debug.compute_and_print(audit)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "17",
      "metadata": {},
      "outputs": [],
      "source": [
        "matching, suggested_matchings = reconcile_transactions(\n",
        "    transactionsA, transactionsB, audit\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "18",
      "metadata": {
        "lines_to_next_cell": 0
      },
      "source": [
        "This time we provide the last pair, 3 - B:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "19",
      "metadata": {
        "lines_to_next_cell": 0
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "            | left        | right       | weight\n",
            "^T5Q1HY0... | ^YYY4HAB... | ^KHNET5G... | 1.0\n",
            "^9MF722E... | ^3CZ78B4... | ^FFN4QBS... | 1.0\n"
          ]
        }
      ],
      "source": [
        "pw.debug.compute_and_print(suggested_matchings)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "20",
      "metadata": {
        "lines_to_next_cell": 0
      },
      "source": [
        "Given those, we should obtain a full --and hopefully correct -- matching."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "21",
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "            | left_key | confidence | right_key\n",
            "^G1GM5SW... | 1        | 1.0        | A\n",
            "^QFWX2FF... | 2        | 0.5        | C\n",
            "^1SQNPMP... | 3        | 1.0        | B\n"
          ]
        }
      ],
      "source": [
        "pw.debug.compute_and_print(matching)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "22",
      "metadata": {},
      "source": [
        "Bingo!\n",
        "\n",
        "It may sound long and tedious but in practice most of the matchings should have been done automatically.\n",
        "This process is only performed for the few remaining cases, where the linkages are hard to make."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "23",
      "metadata": {},
      "source": [
        "## Conclusion\n",
        "In conclusion, writing pipelines with a feedback loop is as easy as can be.\n",
        "When writing such a data processing algorithm, a tip is to always clearly separate inputs from outputs.\n",
        "It is important because the Pathway engine observes inputs for any changes and recalculates parts of the computation when needed.\n",
        "\n",
        "\n",
        "In the next chapter, we will show you how to make a Pathway installation which provides a full Fuzzy-Matching application, complete with frontend. (Coming soon!)\n",
        "\n",
        "![Demo animation](https://pathway.com/assets/content/showcases/fuzzy_join/demo.gif)"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3 (ipykernel)",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.11"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}